test: use default agent preset for integration tests by xingyaoww · Pull Request #1996 · OpenHands/software-agent-sdk

xingyaoww · 2026-02-11T09:23:25Z

Summary

Update integration tests to use the default agent preset from openhands.tools.preset.default, ensuring tests validate the same agent configuration shipped to production (GUI/CLI).

Changes:

tests/integration/base.py: Updated BaseIntegrationTest to:
- Import and use get_default_tools() from the preset
- Add enable_browser property (default False) for tests to override
- Make tools property non-abstract with default implementation using preset
Integration tests (t01-t08): Removed manual tool registration, now using default preset:
- t01_fix_simple_typo.py
- t02_add_bash_hello.py
- t03_jupyter_write_file.py
- t04_git_staging.py
- t05_simple_browsing.py (with enable_browser=True)
- t06_github_pr_browsing.py (with enable_browser=True)
- t07_interactive_commands.py
- t08_image_file_viewing.py
Behavior tests: Updated default_behavior_tools() in behavior_helpers.py to use preset
b05_do_not_create_redundant_files.py: Removed manual tool registration

Net change: -85 lines of code

This ensures integration tests validate the production agent configuration, catching any issues before they reach users.

Fixes #372

Checklist

If the PR is changing/adding functionality, are there tests to reflect this?
If there is an example, have you run the example to make sure that it works?
If there are instructions on how to run the code, have you followed the instructions and made sure that it works?
If the feature is significant enough to require documentation, is there a PR open on the OpenHands/docs repository with the same branch name?
Is the github CI passing?

@xingyaoww can click here to continue refining the PR

Agent Server images for this PR

• GHCR package: https://github.com/OpenHands/agent-sdk/pkgs/container/agent-server

Variants & Base Images

Variant	Architectures	Base Image	Docs / Tags
java	amd64, arm64	`eclipse-temurin:17-jdk`	Link
python	amd64, arm64	`nikolaik/python-nodejs:python3.13-nodejs22-slim`	Link
golang	amd64, arm64	`golang:1.21-bookworm`	Link

Pull (multi-arch manifest)

# Each variant is a multi-arch manifest supporting both amd64 and arm64
docker pull ghcr.io/openhands/agent-server:1f4e52c-python

Run

docker run -it --rm \
  -p 8000:8000 \
  --name agent-server-1f4e52c-python \
  ghcr.io/openhands/agent-server:1f4e52c-python

All tags pushed for this build

ghcr.io/openhands/agent-server:1f4e52c-golang-amd64
ghcr.io/openhands/agent-server:1f4e52c-golang_tag_1.21-bookworm-amd64
ghcr.io/openhands/agent-server:1f4e52c-golang-arm64
ghcr.io/openhands/agent-server:1f4e52c-golang_tag_1.21-bookworm-arm64
ghcr.io/openhands/agent-server:1f4e52c-java-amd64
ghcr.io/openhands/agent-server:1f4e52c-eclipse-temurin_tag_17-jdk-amd64
ghcr.io/openhands/agent-server:1f4e52c-java-arm64
ghcr.io/openhands/agent-server:1f4e52c-eclipse-temurin_tag_17-jdk-arm64
ghcr.io/openhands/agent-server:1f4e52c-python-amd64
ghcr.io/openhands/agent-server:1f4e52c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-amd64
ghcr.io/openhands/agent-server:1f4e52c-python-arm64
ghcr.io/openhands/agent-server:1f4e52c-nikolaik_s_python-nodejs_tag_python3.13-nodejs22-slim-arm64
ghcr.io/openhands/agent-server:1f4e52c-golang
ghcr.io/openhands/agent-server:1f4e52c-java
ghcr.io/openhands/agent-server:1f4e52c-python

About Multi-Architecture Support

Each variant tag (e.g., 1f4e52c-python) is a multi-arch manifest supporting both amd64 and arm64
Docker automatically pulls the correct architecture for your platform
Individual architecture tags (e.g., 1f4e52c-python-amd64) are also available if needed

Update integration tests to use the default agent preset from openhands.tools.preset.default, ensuring tests validate the same agent configuration shipped to production (GUI/CLI). Changes: - Update BaseIntegrationTest to use get_default_tools() by default - Add enable_browser property (default False) for tests to override - Remove manual tool registration from t01-t08 tests - Update behavior test helpers to use default preset - Add enable_browser=True for browsing tests (t05, t06) This ensures integration tests validate the production agent configuration, catching any issues before they reach users. Fixes #372 Co-authored-by: openhands <openhands@all-hands.dev>

all-hands-bot

🟡 Acceptable - Good Refactoring with Minor Design Smell

Verdict: ✅ Worth merging - Solid refactoring that removes duplication and aligns tests with production config.

Key Insight: The -85 line reduction and single source of truth for tool configuration is excellent. The enable_browser flag works but represents a scaling problem for future tool configurations.

See inline comments for specific feedback.

tests/integration/base.py

tests/integration/tests/t05_simple_browsing.py

tests/integration/utils/behavior_helpers.py

openhands-ai · 2026-02-11T09:33:18Z

Looks like there are a few issues preventing this PR from being merged!

GitHub Actions are failing:
- PR Review by OpenHands

If you'd like me to help, just leave a comment, like

@OpenHands please fix the failing actions on PR #1996 at branch `openhands/use-default-preset-for-integration-tests`

Feel free to include any additional details that might help me get this PR into a better state.

_{^{You can manage your notification settings}}

…-tests

enyst

Unless you'd like to negotiate with the reviewer agent, LGTM 😅

github-actions · 2026-02-11T22:31:11Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

…-tests

github-actions · 2026-02-13T10:57:48Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

xingyaoww · 2026-02-15T09:39:57Z

@OpenHands please check the github workflow in this PR https://github.com/OpenHands/software-agent-sdk/actions/runs/21925641027/job/63317348622

Tell me why the integration test in this PR never finish running?

openhands-ai · 2026-02-15T09:40:05Z

I'm on it! xingyaoww can track my progress at all-hands.dev

openhands-ai · 2026-02-15T09:45:24Z

Analysis: Why the Integration Test Never Finishes

After analyzing the GitHub workflow logs and the PR changes, I've identified the root cause of why the integration tests are hanging for 6+ hours before being cancelled.

Root Cause

The PR changes t05_simple_browsing.py and t06_github_pr_browsing.py to use browser tools by setting enable_browser=True, which includes the full BrowserToolSet (14 browser-related tools).

The problem is:

Original design: These tests were designed to work with terminal tools only (curl/wget). The test instruction "Browse localhost:8000, and tell me the ultimate answer to life" was meant to be completed using terminal commands.
PR change: By setting enable_browser=True, the tests now include browser tools. The agent attempts to use actual browser automation (Chromium via Playwright) to complete the task.
Why it hangs:
- Each browser action has a 300-second (5-minute) timeout in BrowserToolExecutor.__call__
- The test has max_iteration_per_run=100 iterations
- Worst case: 100 iterations × 300 seconds = 8.3 hours per test
- The job was cancelled after ~6 hours, which aligns with this calculation
Additional factors:
- Browser operations are inherently slower than terminal commands
- The agent may struggle to complete the task efficiently with browser tools
- The test wasn't designed for browser-based interaction (the HTML page requires clicking a button to reveal the answer)

Evidence from Logs

Job started: 2026-02-11T22:31:34Z
Job cancelled: 2026-02-12T04:36:23Z (~6 hours later)
Step "Run integration test evaluation" was stuck in in_progress state
No output logs were captured for the test execution step

Recommended Fix

The tests t05_simple_browsing.py and t06_github_pr_browsing.py should NOT set enable_browser=True. They should use the default enable_browser=False and rely on terminal tools as originally designed.

Alternatively, if browser testing is desired:

Create separate browser-specific tests with appropriate timeouts
Reduce max_iteration_per_run for browser tests
Add per-test timeouts to prevent runaway execution

View full conversation

github-actions · 2026-02-15T09:50:31Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

Co-authored-by: openhands <openhands@all-hands.dev>

github-actions · 2026-02-15T11:46:54Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-02-27T21:27:19Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

github-actions · 2026-03-06T12:32:56Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

xingyaoww · 2026-03-06T13:46:53Z

@OpenHands review https://github.com/OpenHands/software-agent-sdk/actions/runs/22763604507/job/66025865386?pr=1996 - why it gets stuck there? I think we need to setup chromium for the integration test too, similar to the test-examples tests

openhands-ai · 2026-03-06T13:47:14Z

I'm on it! xingyaoww can track my progress at all-hands.dev

Browser-enabled integration tests (t05, t06) require Chromium to be available. Without a system-installed Chromium, the browser tool's auto-install via uvx playwright hangs or installs without system deps, causing the entire test run to get stuck indefinitely. This mirrors the Chromium setup already present in run-examples.yml. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-03-06T13:50:34Z

Summary

Problem: The integration tests in PR #1996 were getting stuck indefinitely because the integration-runner.yml workflow lacked Chromium installation. Tests t05_simple_browsing.py and t06_github_pr_browsing.py enable browser tools (enable_browser=True), which require Chromium. Without it, the browser tool's auto-install fallback (uvx playwright install chromium --with-deps) either hangs or installs without proper system dependencies, causing the entire test run to stall.

Fix: Added an "Install Chromium" step to the run-integration-tests job in .github/workflows/integration-runner.yml, mirroring the identical setup already present in run-examples.yml:

- name: Install Chromium
  run: |
      sudo apt-get update
      sudo apt-get install -y chromium-browser

Status: The change (1 file, +5 lines) has already been committed and pushed to the openhands/use-default-preset-for-integration-tests branch as commit b96e8bf8.

github-actions · 2026-03-06T14:04:53Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

xingyaoww · 2026-03-06T16:41:09Z

@OpenHands please check the workflow https://github.com/OpenHands/software-agent-sdk/actions/runs/22766812391/job/66036717130?pr=1996 and the PR context, why it is still broken after adding chromium

openhands-ai · 2026-03-06T16:41:31Z

I'm on it! xingyaoww can track my progress at all-hands.dev

After browser-enabled tests (t05, t06) complete, the Chrome process and its async event loop remain alive in ProcessPoolExecutor workers because LocalConversation.close() is never explicitly called. This prevents the worker processes from exiting, causing the CI job to hang indefinitely. Call self.conversation.close() in teardown() to ensure all tool executors (including BrowserToolExecutor) are properly shut down after each test. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-03-06T16:48:24Z

Summary

Problem

The integration tests in PR #1996 were hanging indefinitely (3+ hours until GitHub Actions timeout) even after adding chromium-browser installation to the CI workflow.

Root Cause

The browser tests (t05, t06) were passing successfully — chromium was working fine. The hang occurred after all tests completed because:

BrowserToolExecutor launches a Chrome process with async event loops and background threads
After tests finish, BaseIntegrationTest.teardown() was empty — it never called LocalConversation.close()
The orphan Chrome process and its background threads kept ProcessPoolExecutor worker processes alive indefinitely

Evidence from the cancelled run's cleanup confirmed orphan chrome and chrome_crashpad_handler processes.

Fix (commit `f5a38f12`)

Added self.conversation.close() to BaseIntegrationTest.teardown() in tests/integration/base.py. This calls LocalConversation.close() which already properly shuts down all tool executors (including BrowserToolExecutor → kills Chrome, drains async events, closes the async executor).

Checklist

✅ Root cause identified and explained
✅ Minimal one-line fix — only added self.conversation.close() to the existing teardown() method
✅ Pre-commit hooks all pass (ruff format, ruff lint, pycodestyle, pyright, import checks)
✅ Changes committed and pushed to the PR branch

github-actions · 2026-03-06T17:11:51Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

all-hands-bot · 2026-03-12T12:23:11Z

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

all-hands-bot · 2026-03-17T12:29:42Z

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

all-hands-bot · 2026-03-23T12:30:39Z

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

all-hands-bot · 2026-03-29T12:22:25Z

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

all-hands-bot · 2026-04-03T12:29:23Z

[Automatic Post]: It has been a while since there was any activity on this PR. @xingyaoww, are you still working on it? If so, please go ahead, if not then please request review, close it, or request that someone else follow up.

…-tests

github-actions · 2026-04-04T05:18:34Z

Hi! I started running the integration tests on your PR. You will receive a comment with the results shortly.

xingyaoww · 2026-04-05T17:23:56Z

@OpenHands pls understand the context of this pr, then understand why https://github.com/OpenHands/software-agent-sdk/actions/runs/23972109855/job/69930673480?pr=1996 failed and fix it

openhands-ai · 2026-04-05T17:24:18Z

I'm on it! xingyaoww can track my progress at all-hands.dev

- Rename t01-t04, t07-t08 test files back from _t*.py to t*.py so the test runner discovers them again (they were temporarily disabled for debugging but never re-enabled). - Change ProcessPoolExecutor to use shutdown(wait=False) after all results are collected, preventing the runner from hanging indefinitely when browser/Chrome child processes survive worker teardown. - Clear BrowserToolSet._shared_executor class variable on close() so stale references don't prevent worker process exit. Co-authored-by: openhands <openhands@all-hands.dev>

openhands-ai · 2026-04-05T17:33:36Z

Summary of Changes

The PR comment asked me to understand the PR context, diagnose why the CI workflow failed, and fix it.

Was the request fully addressed?

Understood the PR context: PR test: use default agent preset for integration tests #1996 updates integration tests to use the default agent preset from openhands.tools.preset.default instead of manual tool registration.
Diagnosed the CI failure: All 4 model test jobs timed out at 3 hours because:
1. Tests t01–t04, t07–t08 were renamed with a _ prefix (commit c9ef6c99, "temporarily disable non-browser tests") but never re-enabled, leaving only the 2 browser tests (t05, t06) to run.
2. Both browser tests passed, but ProcessPoolExecutor.shutdown(wait=True) hung indefinitely because orphan Chrome/tmux subprocesses prevented worker process exit.
3. No test artifacts were uploaded → the consolidate-results job failed with "Results directory not found."
Fixed the issue: Pushed commit e7ca120e with three targeted changes.

Are the changes concise?

Yes, all three changes directly address the failure:

Renamed 6 test files back from _t*.py → t*.py so the test runner's [tbc]*.py glob discovers them again.
Changed run_evaluation() in run_infer.py to use executor.shutdown(wait=False, cancel_futures=True) instead of the context manager's implicit shutdown(wait=True), preventing the process hang after all results are collected.
Added cleanup in BrowserToolExecutor.close() to reset BrowserToolSet._shared_executor = None, releasing the stale class-level reference that could prevent worker process exit.

No extraneous changes were made.

github-actions · 2026-04-05T17:39:41Z

Coverage Report •

File	Stmts	Miss	Cover	Missing
openhands-tools/openhands/tools/browser_use
impl.py	259	158	39%	57–61, 63–64, 66, 68, 70–73, 75–76, 78, 80, 87, 99–102, 108–109, 114, 116–118, 120–121, 129–131, 133–137, 142, 193, 198–201, 203, 225–227, 230–232, 234, 247, 284–285, 289, 299, 314–315, 320, 334–335, 340–341, 351, 369–370, 372–385, 388–401, 403–404, 410, 415–418, 426, 428, 431–432, 438–439, 444–445, 451–452, 456–457, 461–462, 466, 468–469, 471–474, 477–478, 484, 486, 488, 496–497, 501–502, 507–508, 512–513, 517–518, 523–524, 536–537, 548–549, 553–557, 571–573, 578, 583–584, 594, 600–601
TOTAL	22011	11139	49%

openhands-ai bot mentioned this pull request Feb 11, 2026

Test: use default agent preset for integration tests #372

Open

all-hands-bot reviewed Feb 11, 2026

View reviewed changes

tests/integration/base.py Show resolved Hide resolved

tests/integration/base.py Outdated Show resolved Hide resolved

tests/integration/tests/t05_simple_browsing.py Show resolved Hide resolved

tests/integration/utils/behavior_helpers.py Outdated Show resolved Hide resolved

xingyaoww and others added 2 commits February 11, 2026 16:21

Merge branch 'main' into openhands/use-default-preset-for-integration…

eff042e

…-tests

Merge branch 'main' into openhands/use-default-preset-for-integration…

0c1c46a

…-tests

enyst approved these changes Feb 11, 2026

View reviewed changes

xingyaoww added the integration-test Runs the integration tests and comments the results label Feb 11, 2026

Merge branch 'main' into openhands/use-default-preset-for-integration…

fdf88a0

…-tests

xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Feb 13, 2026

enyst added behavior-initiative This is related to the system prompt sections and LLM steering. and removed behavior-initiative This is related to the system prompt sections and LLM steering. labels Feb 14, 2026

xingyaoww removed the integration-test Runs the integration tests and comments the results label Feb 15, 2026

xingyaoww added the integration-test Runs the integration tests and comments the results label Feb 15, 2026 — with OpenHands AI

test: temporarily disable non-browser tests to debug browser test hang

c9ef6c9

Co-authored-by: openhands <openhands@all-hands.dev>

xingyaoww removed the integration-test Runs the integration tests and comments the results label Feb 15, 2026

xingyaoww added the integration-test Runs the integration tests and comments the results label Feb 15, 2026 — with OpenHands AI

OpenHands deleted a comment from all-hands-bot Feb 27, 2026

enyst added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Feb 27, 2026

xingyaoww added the integration-test Runs the integration tests and comments the results label Mar 6, 2026

xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Mar 6, 2026

Merge branch 'main' into openhands/use-default-preset-for-integration…

1561e47

…-tests

xingyaoww added integration-test Runs the integration tests and comments the results and removed integration-test Runs the integration tests and comments the results labels Apr 4, 2026

Conversation

xingyaoww commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes:

Net change: -85 lines of code

Checklist

Uh oh!

all-hands-bot left a comment

Choose a reason for hiding this comment

🟡 Acceptable - Good Refactoring with Minor Design Smell

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openhands-ai bot commented Feb 11, 2026

Uh oh!

enyst left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

xingyaoww commented Feb 15, 2026

Uh oh!

openhands-ai bot commented Feb 15, 2026

Uh oh!

openhands-ai bot commented Feb 15, 2026

Analysis: Why the Integration Test Never Finishes

Root Cause

Evidence from Logs

Recommended Fix

Uh oh!

github-actions bot commented Feb 15, 2026

Uh oh!

github-actions bot commented Feb 15, 2026

Uh oh!

github-actions bot commented Feb 27, 2026

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

xingyaoww commented Mar 6, 2026

Uh oh!

openhands-ai bot commented Mar 6, 2026

Uh oh!

openhands-ai bot commented Mar 6, 2026

Summary

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

xingyaoww commented Mar 6, 2026

Uh oh!

openhands-ai bot commented Mar 6, 2026

Uh oh!

openhands-ai bot commented Mar 6, 2026

Summary

Problem

Root Cause

Fix (commit f5a38f12)

Checklist

Uh oh!

github-actions bot commented Mar 6, 2026

Uh oh!

all-hands-bot commented Mar 12, 2026

Uh oh!

all-hands-bot commented Mar 17, 2026

Uh oh!

all-hands-bot commented Mar 23, 2026

Uh oh!

all-hands-bot commented Mar 29, 2026

Uh oh!

all-hands-bot commented Apr 3, 2026

Uh oh!

github-actions bot commented Apr 4, 2026

Uh oh!

xingyaoww commented Apr 5, 2026

Uh oh!

openhands-ai bot commented Apr 5, 2026

xingyaoww commented Feb 11, 2026 •

edited by github-actions bot

Loading

Fix (commit `f5a38f12`)